{
"cells": [
{
"cell_type": "markdown",
"id": "2ec566f5-ff28-43b8-b9d5-287a906ff9e6",
"metadata": {},
"source": [
"# Movie_Genre_Classification"
]
},
{
"cell_type": "markdown",
"id": "ff2351d9-7d15-4c4c-bf62-d4b300cfcadc",
"metadata": {},
"source": [
"Create a machine learning model that can predict the genre of a movie based on its plot summary or other textual information. You can use techniques like TF-IDF or word embeddings with classifiers such as Naive Bayes, Logistic Regression, or Support Vector Machines."
]
},
{
"cell_type": "markdown",
"id": "7470562b-4b03-4cae-9ff1-6da8eabbb032",
"metadata": {},
"source": [
"### Import necessary files"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c75d012f-5f54-4d71-857b-d29ba4b77c90",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import nltk\n",
"import string\n",
"import re\n",
"%matplotlib inline\n",
"\n",
"from nltk.corpus import stopwords\n",
"from nltk.stem import LancasterStemmer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.svm import SVC\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"id": "e277bcf2-81d5-41f3-a43c-7dbbffba78fe",
"metadata": {},
"source": [
"### Load Train Dataset"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "68f24b61-c5bf-4d72-b45e-1ec55e58632c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\kiran\\AppData\\Local\\Temp\\ipykernel_12232\\595940393.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
" train_data = pd.read_csv(train_path, sep = ':::', names = ['Title', 'Genre', 'Description'])\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Title \n",
" Genre \n",
" Description \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" Oscar et la dame rose (2009) \n",
" drama \n",
" Listening in to a conversation between his do... \n",
" \n",
" \n",
" 2 \n",
" Cupid (1997) \n",
" thriller \n",
" A brother and sister with a past incestuous r... \n",
" \n",
" \n",
" 3 \n",
" Young, Wild and Wonderful (1980) \n",
" adult \n",
" As the bus empties the students for their fie... \n",
" \n",
" \n",
" 4 \n",
" The Secret Sin (1915) \n",
" drama \n",
" To help their unemployed father make ends mee... \n",
" \n",
" \n",
" 5 \n",
" The Unrecovered (2007) \n",
" drama \n",
" The film's title refers not only to the un-re... \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Title Genre \\\n",
"1 Oscar et la dame rose (2009) drama \n",
"2 Cupid (1997) thriller \n",
"3 Young, Wild and Wonderful (1980) adult \n",
"4 The Secret Sin (1915) drama \n",
"5 The Unrecovered (2007) drama \n",
"\n",
" Description \n",
"1 Listening in to a conversation between his do... \n",
"2 A brother and sister with a past incestuous r... \n",
"3 As the bus empties the students for their fie... \n",
"4 To help their unemployed father make ends mee... \n",
"5 The film's title refers not only to the un-re... "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_path = ('Downloads/archive/Genre Classification Dataset/train_data.txt')\n",
"train_data = pd.read_csv(train_path, sep = ':::', names = ['Title', 'Genre', 'Description'])\n",
"train_data.head()"
]
},
{
"cell_type": "markdown",
"id": "e00f21d2-820b-41cd-9794-7ec015dab8fd",
"metadata": {},
"source": [
"### Load Test Dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b2f19cb0-f4d8-4faf-9ffc-892aaedcb9d1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\kiran\\AppData\\Local\\Temp\\ipykernel_12232\\1819879311.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
" test_data = pd.read_csv(test_path, sep = ':::', names = ['id', 'Title', 'Description'])\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" id \n",
" Title \n",
" Description \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" Edgar's Lunch (1998) \n",
" L.R. Brane loves his life - his car, his apar... \n",
" \n",
" \n",
" 1 \n",
" 2 \n",
" La guerra de papá (1977) \n",
" Spain, March 1964: Quico is a very naughty ch... \n",
" \n",
" \n",
" 2 \n",
" 3 \n",
" Off the Beaten Track (2010) \n",
" One year in the life of Albin and his family ... \n",
" \n",
" \n",
" 3 \n",
" 4 \n",
" Meu Amigo Hindu (2015) \n",
" His father has died, he hasn't spoken with hi... \n",
" \n",
" \n",
" 4 \n",
" 5 \n",
" Er nu zhai (1955) \n",
" Before he was known internationally as a mart... \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id Title \\\n",
"0 1 Edgar's Lunch (1998) \n",
"1 2 La guerra de papá (1977) \n",
"2 3 Off the Beaten Track (2010) \n",
"3 4 Meu Amigo Hindu (2015) \n",
"4 5 Er nu zhai (1955) \n",
"\n",
" Description \n",
"0 L.R. Brane loves his life - his car, his apar... \n",
"1 Spain, March 1964: Quico is a very naughty ch... \n",
"2 One year in the life of Albin and his family ... \n",
"3 His father has died, he hasn't spoken with hi... \n",
"4 Before he was known internationally as a mart... "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_path = ('Downloads/archive/Genre Classification Dataset/test_data.txt')\n",
"test_data = pd.read_csv(test_path, sep = ':::', names = ['id', 'Title', 'Description'])\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"id": "8d81a971-2358-4dfe-8f05-212028bde730",
"metadata": {},
"source": [
"### Load Target Dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8b33a722-ceb7-42cb-ae77-5f89816093eb",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\kiran\\AppData\\Local\\Temp\\ipykernel_12232\\320415847.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
" test_soln_data = pd.read_csv(test_soln_path, sep = ':::', names = ['Title', 'Genre', 'Description'])\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Target_Genre \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" thriller \n",
" \n",
" \n",
" 2 \n",
" comedy \n",
" \n",
" \n",
" 3 \n",
" documentary \n",
" \n",
" \n",
" 4 \n",
" drama \n",
" \n",
" \n",
" 5 \n",
" drama \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Target_Genre\n",
"1 thriller \n",
"2 comedy \n",
"3 documentary \n",
"4 drama \n",
"5 drama "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_soln_path = ('Downloads/archive/Genre Classification Dataset/test_data_solution.txt')\n",
"test_soln_data = pd.read_csv(test_soln_path, sep = ':::', names = ['Title', 'Genre', 'Description'])\n",
"test_soln_data.drop(test_soln_data.columns[[0,2]], axis = 1, inplace = True)\n",
"test_soln_data.rename(columns = {'Genre':'Target_Genre'}, inplace = True)\n",
"test_soln_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e4742eb5-1b02-4ad0-8e13-e22b5a8c7909",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Title \n",
" Genre \n",
" Description \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 54214 \n",
" 54214 \n",
" 54214 \n",
" \n",
" \n",
" unique \n",
" 54214 \n",
" 27 \n",
" 54086 \n",
" \n",
" \n",
" top \n",
" Oscar et la dame rose (2009) \n",
" drama \n",
" Grammy - music award of the American academy ... \n",
" \n",
" \n",
" freq \n",
" 1 \n",
" 13613 \n",
" 12 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Title Genre \\\n",
"count 54214 54214 \n",
"unique 54214 27 \n",
"top Oscar et la dame rose (2009) drama \n",
"freq 1 13613 \n",
"\n",
" Description \n",
"count 54214 \n",
"unique 54086 \n",
"top Grammy - music award of the American academy ... \n",
"freq 12 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.describe()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "52568b32-6358-450f-bfd7-bbcd17e097bb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Index: 54214 entries, 1 to 54214\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Title 54214 non-null object\n",
" 1 Genre 54214 non-null object\n",
" 2 Description 54214 non-null object\n",
"dtypes: object(3)\n",
"memory usage: 1.7+ MB\n"
]
}
],
"source": [
"train_data.info()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d3334ee4-9ebe-4559-868f-9a473a5a5e62",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" id \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 54200.000000 \n",
" \n",
" \n",
" mean \n",
" 27100.500000 \n",
" \n",
" \n",
" std \n",
" 15646.336632 \n",
" \n",
" \n",
" min \n",
" 1.000000 \n",
" \n",
" \n",
" 25% \n",
" 13550.750000 \n",
" \n",
" \n",
" 50% \n",
" 27100.500000 \n",
" \n",
" \n",
" 75% \n",
" 40650.250000 \n",
" \n",
" \n",
" max \n",
" 54200.000000 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id\n",
"count 54200.000000\n",
"mean 27100.500000\n",
"std 15646.336632\n",
"min 1.000000\n",
"25% 13550.750000\n",
"50% 27100.500000\n",
"75% 40650.250000\n",
"max 54200.000000"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "97f1c149-33ee-4f3c-b138-53debccf05e5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 54200 entries, 0 to 54199\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 id 54200 non-null int64 \n",
" 1 Title 54200 non-null object\n",
" 2 Description 54200 non-null object\n",
"dtypes: int64(1), object(2)\n",
"memory usage: 1.2+ MB\n"
]
}
],
"source": [
"test_data.info()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "173310ae-b63a-4598-8577-590723d79042",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Title 0\n",
"Genre 0\n",
"Description 0\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b12040c8-6b65-4106-aee8-ba0143a0407d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Genre\n",
" drama 13613\n",
" documentary 13096\n",
" comedy 7447\n",
" short 5073\n",
" horror 2204\n",
" thriller 1591\n",
" action 1315\n",
" western 1032\n",
" reality-tv 884\n",
" family 784\n",
" adventure 775\n",
" music 731\n",
" romance 672\n",
" sci-fi 647\n",
" adult 590\n",
" crime 505\n",
" animation 498\n",
" sport 432\n",
" talk-show 391\n",
" fantasy 323\n",
" mystery 319\n",
" musical 277\n",
" biography 265\n",
" history 243\n",
" game-show 194\n",
" news 181\n",
" war 132\n",
"Name: count, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = train_data.Genre.value_counts()\n",
"counts"
]
},
{
"cell_type": "markdown",
"id": "2841dea6-55c1-4ff0-9b60-9523e87b8222",
"metadata": {},
"source": [
"#### Ploting the counts of Genres in the training dataset"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1ac0dd26-879b-460a-9df0-5c29e6dbdab6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize = (10,8))\n",
"sns.countplot(data=train_data, y='Genre', order=counts.index)"
]
},
{
"cell_type": "markdown",
"id": "54321b93-991d-494f-a094-bf61082bb090",
"metadata": {},
"source": [
"#### Ploting the distribution of Genres using a bar plot"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "eae1e83e-649c-45f2-8333-79bd768ecc40",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n",
" 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]),\n",
" [Text(0, 0, ' drama '),\n",
" Text(1, 0, ' documentary '),\n",
" Text(2, 0, ' comedy '),\n",
" Text(3, 0, ' short '),\n",
" Text(4, 0, ' horror '),\n",
" Text(5, 0, ' thriller '),\n",
" Text(6, 0, ' action '),\n",
" Text(7, 0, ' western '),\n",
" Text(8, 0, ' reality-tv '),\n",
" Text(9, 0, ' family '),\n",
" Text(10, 0, ' adventure '),\n",
" Text(11, 0, ' music '),\n",
" Text(12, 0, ' romance '),\n",
" Text(13, 0, ' sci-fi '),\n",
" Text(14, 0, ' adult '),\n",
" Text(15, 0, ' crime '),\n",
" Text(16, 0, ' animation '),\n",
" Text(17, 0, ' sport '),\n",
" Text(18, 0, ' talk-show '),\n",
" Text(19, 0, ' fantasy '),\n",
" Text(20, 0, ' mystery '),\n",
" Text(21, 0, ' musical '),\n",
" Text(22, 0, ' biography '),\n",
" Text(23, 0, ' history '),\n",
" Text(24, 0, ' game-show '),\n",
" Text(25, 0, ' news '),\n",
" Text(26, 0, ' war ')])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize = (10,8))\n",
"sns.barplot(x=counts.index, y=counts)\n",
"plt.xticks(rotation=90)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8f6ed485-e0bb-42b3-95d4-95a34c770b1b",
"metadata": {},
"outputs": [],
"source": [
"stemmer = LancasterStemmer()\n",
"stop_words = set(stopwords.words('english'))\n",
"\n",
"def corpus(text):\n",
" text = text.lower() # Lowercase all characters\n",
" text = re.sub(r'@\\S+', '', text) # Remove Twitter handles\n",
" text = re.sub(r'http\\S+', '', text) # Remove URLs\n",
" text = re.sub(r'pic.\\S+', '', text)\n",
" text = re.sub(r\"[^a-zA-Z+']\", ' ', text) # Keep only characters\n",
" text = re.sub(r'\\s+[a-zA-Z]\\s+', ' ', text + ' ') # Keep words with length > 1 only\n",
" text = \"\".join([i for i in text if i not in string.punctuation])\n",
" words = nltk.word_tokenize(text)\n",
" stopwords = nltk.corpus.stopwords.words('english') # Remove stopwords\n",
" text = \" \".join([i for i in words if i not in stopwords and len(i) > 2])\n",
" text = re.sub(\"\\s[\\s]+\", \" \", text).strip() # Remove repeated/leading/trailing spaces\n",
" return text\n",
"\n",
"train_data['Corpus_cleaning'] = train_data['Description'].apply(corpus)\n",
"test_data['Corpus_cleaning'] = test_data['Description'].apply(corpus)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4d6ddc89-b80e-4706-a0c1-938b7658885b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Title \n",
" Genre \n",
" Description \n",
" Corpus_cleaning \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" Oscar et la dame rose (2009) \n",
" drama \n",
" Listening in to a conversation between his do... \n",
" listening conversation doctor parents year old... \n",
" \n",
" \n",
" 2 \n",
" Cupid (1997) \n",
" thriller \n",
" A brother and sister with a past incestuous r... \n",
" brother sister past incestuous relationship cu... \n",
" \n",
" \n",
" 3 \n",
" Young, Wild and Wonderful (1980) \n",
" adult \n",
" As the bus empties the students for their fie... \n",
" bus empties students field trip museum natural... \n",
" \n",
" \n",
" 4 \n",
" The Secret Sin (1915) \n",
" drama \n",
" To help their unemployed father make ends mee... \n",
" help unemployed father make ends meet edith tw... \n",
" \n",
" \n",
" 5 \n",
" The Unrecovered (2007) \n",
" drama \n",
" The film's title refers not only to the un-re... \n",
" films title refers recovered bodies ground zer... \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 54210 \n",
" \"Bonino\" (1953) \n",
" comedy \n",
" This short-lived NBC live sitcom centered on ... \n",
" short lived nbc live sitcom centered bonino wo... \n",
" \n",
" \n",
" 54211 \n",
" Dead Girls Don't Cry (????) \n",
" horror \n",
" The NEXT Generation of EXPLOITATION. The sist... \n",
" next generation exploitation sisters kapa bay ... \n",
" \n",
" \n",
" 54212 \n",
" Ronald Goedemondt: Ze bestaan echt (2008) \n",
" documentary \n",
" Ze bestaan echt, is a stand-up comedy about g... \n",
" bestaan echt stand comedy growing facing fears... \n",
" \n",
" \n",
" 54213 \n",
" Make Your Own Bed (1944) \n",
" comedy \n",
" Walter and Vivian live in the country and hav... \n",
" walter vivian live country difficult time keep... \n",
" \n",
" \n",
" 54214 \n",
" Nature's Fury: Storm of the Century (2006) \n",
" history \n",
" On Labor Day Weekend, 1935, the most intense ... \n",
" labor day weekend intense hurricane ever make ... \n",
" \n",
" \n",
"
\n",
"
54214 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Title Genre \\\n",
"1 Oscar et la dame rose (2009) drama \n",
"2 Cupid (1997) thriller \n",
"3 Young, Wild and Wonderful (1980) adult \n",
"4 The Secret Sin (1915) drama \n",
"5 The Unrecovered (2007) drama \n",
"... ... ... \n",
"54210 \"Bonino\" (1953) comedy \n",
"54211 Dead Girls Don't Cry (????) horror \n",
"54212 Ronald Goedemondt: Ze bestaan echt (2008) documentary \n",
"54213 Make Your Own Bed (1944) comedy \n",
"54214 Nature's Fury: Storm of the Century (2006) history \n",
"\n",
" Description \\\n",
"1 Listening in to a conversation between his do... \n",
"2 A brother and sister with a past incestuous r... \n",
"3 As the bus empties the students for their fie... \n",
"4 To help their unemployed father make ends mee... \n",
"5 The film's title refers not only to the un-re... \n",
"... ... \n",
"54210 This short-lived NBC live sitcom centered on ... \n",
"54211 The NEXT Generation of EXPLOITATION. The sist... \n",
"54212 Ze bestaan echt, is a stand-up comedy about g... \n",
"54213 Walter and Vivian live in the country and hav... \n",
"54214 On Labor Day Weekend, 1935, the most intense ... \n",
"\n",
" Corpus_cleaning \n",
"1 listening conversation doctor parents year old... \n",
"2 brother sister past incestuous relationship cu... \n",
"3 bus empties students field trip museum natural... \n",
"4 help unemployed father make ends meet edith tw... \n",
"5 films title refers recovered bodies ground zer... \n",
"... ... \n",
"54210 short lived nbc live sitcom centered bonino wo... \n",
"54211 next generation exploitation sisters kapa bay ... \n",
"54212 bestaan echt stand comedy growing facing fears... \n",
"54213 walter vivian live country difficult time keep... \n",
"54214 labor day weekend intense hurricane ever make ... \n",
"\n",
"[54214 rows x 4 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "67c30bbe-794f-4152-867a-e1b8f7a6e70d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" id \n",
" Title \n",
" Description \n",
" Corpus_cleaning \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" Edgar's Lunch (1998) \n",
" L.R. Brane loves his life - his car, his apar... \n",
" brane loves life car apartment job especially ... \n",
" \n",
" \n",
" 1 \n",
" 2 \n",
" La guerra de papá (1977) \n",
" Spain, March 1964: Quico is a very naughty ch... \n",
" spain march quico naughty child three belongin... \n",
" \n",
" \n",
" 2 \n",
" 3 \n",
" Off the Beaten Track (2010) \n",
" One year in the life of Albin and his family ... \n",
" one year life albin family shepherds north tra... \n",
" \n",
" \n",
" 3 \n",
" 4 \n",
" Meu Amigo Hindu (2015) \n",
" His father has died, he hasn't spoken with hi... \n",
" father died hasnt spoken brother years serious... \n",
" \n",
" \n",
" 4 \n",
" 5 \n",
" Er nu zhai (1955) \n",
" Before he was known internationally as a mart... \n",
" known internationally martial arts superstar b... \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 54195 \n",
" 54196 \n",
" \"Tales of Light & Dark\" (2013) \n",
" Covering multiple genres, Tales of Light & Da... \n",
" covering multiple genres tales light dark anth... \n",
" \n",
" \n",
" 54196 \n",
" 54197 \n",
" Der letzte Mohikaner (1965) \n",
" As Alice and Cora Munro attempt to find their... \n",
" alice cora munro attempt find father british o... \n",
" \n",
" \n",
" 54197 \n",
" 54198 \n",
" Oliver Twink (2007) \n",
" A movie 169 years in the making. Oliver Twist... \n",
" movie years making oliver twist artful dodger ... \n",
" \n",
" \n",
" 54198 \n",
" 54199 \n",
" Slipstream (1973) \n",
" Popular, but mysterious rock D.J Mike Mallard... \n",
" popular mysterious rock mike mallard askew bro... \n",
" \n",
" \n",
" 54199 \n",
" 54200 \n",
" Curitiba Zero Grau (2010) \n",
" Curitiba is a city in movement, with rhythms ... \n",
" curitiba city movement rhythms different pulsa... \n",
" \n",
" \n",
"
\n",
"
54200 rows × 4 columns
\n",
"
"
],
"text/plain": [
" id Title \\\n",
"0 1 Edgar's Lunch (1998) \n",
"1 2 La guerra de papá (1977) \n",
"2 3 Off the Beaten Track (2010) \n",
"3 4 Meu Amigo Hindu (2015) \n",
"4 5 Er nu zhai (1955) \n",
"... ... ... \n",
"54195 54196 \"Tales of Light & Dark\" (2013) \n",
"54196 54197 Der letzte Mohikaner (1965) \n",
"54197 54198 Oliver Twink (2007) \n",
"54198 54199 Slipstream (1973) \n",
"54199 54200 Curitiba Zero Grau (2010) \n",
"\n",
" Description \\\n",
"0 L.R. Brane loves his life - his car, his apar... \n",
"1 Spain, March 1964: Quico is a very naughty ch... \n",
"2 One year in the life of Albin and his family ... \n",
"3 His father has died, he hasn't spoken with hi... \n",
"4 Before he was known internationally as a mart... \n",
"... ... \n",
"54195 Covering multiple genres, Tales of Light & Da... \n",
"54196 As Alice and Cora Munro attempt to find their... \n",
"54197 A movie 169 years in the making. Oliver Twist... \n",
"54198 Popular, but mysterious rock D.J Mike Mallard... \n",
"54199 Curitiba is a city in movement, with rhythms ... \n",
"\n",
" Corpus_cleaning \n",
"0 brane loves life car apartment job especially ... \n",
"1 spain march quico naughty child three belongin... \n",
"2 one year life albin family shepherds north tra... \n",
"3 father died hasnt spoken brother years serious... \n",
"4 known internationally martial arts superstar b... \n",
"... ... \n",
"54195 covering multiple genres tales light dark anth... \n",
"54196 alice cora munro attempt find father british o... \n",
"54197 movie years making oliver twist artful dodger ... \n",
"54198 popular mysterious rock mike mallard askew bro... \n",
"54199 curitiba city movement rhythms different pulsa... \n",
"\n",
"[54200 rows x 4 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "8146b3bf-efce-4f4c-8eb3-0185431aff60",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape before drop nulls (54214, 4)\n",
"shape after drop nulls (54214, 4)\n"
]
}
],
"source": [
"print(\"shape before drop nulls\",train_data.shape)\n",
"train_data = train_data.drop_duplicates()\n",
"print(\"shape after drop nulls\",train_data.shape)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "83d3f5ba-3231-4030-b7b1-a429c87f2066",
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings(\"ignore\", \"use_inf_as_na\")\n",
"train_data['length_Corpus_cleaning'] = train_data['Corpus_cleaning'].apply(len)"
]
},
{
"cell_type": "markdown",
"id": "601a2ed8-6ef2-470a-835b-18ed99ee2db2",
"metadata": {},
"source": [
"#### Visualizing the text length"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "6576f225-4222-498b-a0f3-5e44d0c46ddc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Frequency')"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.histplot(data = train_data, x = train_data['length_Corpus_cleaning'], bins = 20, kde = True)\n",
"plt.xlabel('Length')\n",
"plt.ylabel('Frequency')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "c3be52c8-02b2-4db8-a547-db438bd0371f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Frequency')"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12, 6))\n",
"# Subplot 1: Original text length distribution\n",
"plt.subplot(1, 2, 1)\n",
"original_lengths = train_data['Description'].apply(len)\n",
"plt.hist(original_lengths, bins=range(0, max(original_lengths) + 100, 100), color = 'blue')\n",
"plt.title('Original Text Length')\n",
"plt.xlabel('Text Length')\n",
"plt.ylabel('Frequency')\n",
"\n",
"# Subplot 2: Cleaned text length distribution\n",
"plt.subplot(1, 2, 2)\n",
"cleaned_lengths = train_data['Corpus_cleaning'].apply(len)\n",
"plt.hist(cleaned_lengths, bins=range(0, max(cleaned_lengths) + 100, 100), color = 'green')\n",
"plt.title('Cleaned Text Length')\n",
"plt.xlabel('Text Length')\n",
"plt.ylabel('Frequency')"
]
},
{
"cell_type": "markdown",
"id": "33215181-2ed5-4767-b845-97a971e2e5c5",
"metadata": {},
"source": [
"### TF-IDF Text vectorization"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "5d96cba1-b9ca-4f3c-b147-15c8194b9eb0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: total: 6.72 s\n",
"Wall time: 6.72 s\n"
]
}
],
"source": [
"%%time\n",
"tfidf = TfidfVectorizer()\n",
"X_train = tfidf.fit_transform(train_data['Corpus_cleaning'])\n",
"X_test = tfidf.transform(test_data['Corpus_cleaning'])"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "f0da9268-e1ab-4f1c-991c-863a57d1962a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<54214x124210 sparse matrix of type ''\n",
"\twith 2640592 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "bcc453e4-491d-443e-a40b-386b896028f2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<54200x124210 sparse matrix of type ''\n",
"\twith 2578617 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "d886c9c9-7dd9-4f52-8de5-3cf3d8ae31e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 listening conversation doctor parents year old...\n",
"2 brother sister past incestuous relationship cu...\n",
"3 bus empties students field trip museum natural...\n",
"4 help unemployed father make ends meet edith tw...\n",
"5 films title refers recovered bodies ground zer...\n",
" ... \n",
"54210 short lived nbc live sitcom centered bonino wo...\n",
"54211 next generation exploitation sisters kapa bay ...\n",
"54212 bestaan echt stand comedy growing facing fears...\n",
"54213 walter vivian live country difficult time keep...\n",
"54214 labor day weekend intense hurricane ever make ...\n",
"Name: Corpus_cleaning, Length: 54214, dtype: object"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data['Corpus_cleaning']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "be778260-f5a9-4f04-9c22-46a68dc68b03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(54214, 124210)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "f72a5939-82e7-4b35-9cef-5fb30652f42f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(54200, 124210)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "c9aeaea2-54c0-4ad0-a824-2ebf5975bf95",
"metadata": {},
"outputs": [],
"source": [
"X = X_train\n",
"y = train_data['Genre']\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "3a25d932-d449-4973-9b59-db4aadd6b3df",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(43371, 124210)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "010f5b3f-1c20-4e69-9e18-beba617e2d2a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(43371,)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "b24b22f0-2f29-4c00-9c4d-937fe6a68678",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10843, 124210)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "263637ee-e8a2-47ea-954c-1339bc77df66",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10843,)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test.shape"
]
},
{
"cell_type": "markdown",
"id": "ec70ee6f-289a-4f81-9cca-92cdcfdaa974",
"metadata": {},
"source": [
"### Multinomial Naive Bayes"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "c6b1ebe4-478f-4969-8add-6b69c716785a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: total: 578 ms\n",
"Wall time: 598 ms\n"
]
},
{
"data": {
"text/html": [
"MultinomialNB() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"MultinomialNB()"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"model_nb = MultinomialNB()\n",
"model_nb.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "e350fd44-174d-4051-900c-65fb360efaed",
"metadata": {},
"outputs": [],
"source": [
"y_nb_pred = model_nb.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "8d679600-19f7-4668-ba42-9ec8e85af419",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10843,)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_nb_pred.shape"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "0b6fd82b-ec90-41c6-a7d9-05fd1884d5ea",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, classification_report"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "7a589db0-1f3f-4c87-84f7-c00cc6290661",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.44526422576777647"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy_score(y_nb_pred, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "f1cc2b45-d92b-4bf7-b2b7-b5346f313484",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" action 0.00 0.00 0.00 0\n",
" adult 0.00 0.00 0.00 0\n",
" adventure 0.00 0.00 0.00 0\n",
" animation 0.00 0.00 0.00 0\n",
" biography 0.00 0.00 0.00 0\n",
" comedy 0.04 0.61 0.07 93\n",
" crime 0.00 0.00 0.00 0\n",
" documentary 0.90 0.54 0.67 4462\n",
" drama 0.88 0.38 0.53 6284\n",
" family 0.00 0.00 0.00 0\n",
" fantasy 0.00 0.00 0.00 0\n",
" game-show 0.00 0.00 0.00 0\n",
" history 0.00 0.00 0.00 0\n",
" horror 0.00 0.00 0.00 0\n",
" music 0.00 0.00 0.00 0\n",
" musical 0.00 0.00 0.00 0\n",
" mystery 0.00 0.00 0.00 0\n",
" news 0.00 0.00 0.00 0\n",
" reality-tv 0.00 0.00 0.00 0\n",
" romance 0.00 0.00 0.00 0\n",
" sci-fi 0.00 0.00 0.00 0\n",
" short 0.00 0.50 0.00 4\n",
" sport 0.00 0.00 0.00 0\n",
" talk-show 0.00 0.00 0.00 0\n",
" thriller 0.00 0.00 0.00 0\n",
" war 0.00 0.00 0.00 0\n",
" western 0.00 0.00 0.00 0\n",
"\n",
" accuracy 0.45 10843\n",
" macro avg 0.07 0.08 0.05 10843\n",
" weighted avg 0.88 0.45 0.58 10843\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_nb_pred, y_test))"
]
},
{
"cell_type": "markdown",
"id": "4ab23d24-49a3-4c86-90b1-caeb3e9760b2",
"metadata": {},
"source": [
"### Logistic Regression"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "9d305093-61c4-44cb-bd00-422ec38c1951",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: total: 4min 10s\n",
"Wall time: 1min 30s\n"
]
},
{
"data": {
"text/html": [
"LogisticRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"LogisticRegression()"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"model_lr = LogisticRegression()\n",
"model_lr.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "c3643a91-1c49-4089-a2a5-6b8eb64cb644",
"metadata": {},
"outputs": [],
"source": [
"y_lr_pred = model_lr.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "7e5e5766-1724-44c2-9c90-48e5dd161b93",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([' comedy ', ' drama ', ' documentary ', ..., ' drama ', ' short ',\n",
" ' horror '], dtype=object)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_lr_pred"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "cf0d4a6f-e88d-42db-b62f-987ce19c1d0e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5808355621138062"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy_score(y_lr_pred, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "46fa7225-49df-4782-8da1-41a3867c7254",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" action 0.22 0.56 0.32 103\n",
" adult 0.21 0.82 0.33 28\n",
" adventure 0.11 0.50 0.18 30\n",
" animation 0.02 0.67 0.04 3\n",
" biography 0.00 0.00 0.00 0\n",
" comedy 0.59 0.53 0.56 1602\n",
" crime 0.01 0.50 0.02 2\n",
" documentary 0.86 0.65 0.74 3520\n",
" drama 0.81 0.53 0.64 4108\n",
" family 0.05 0.50 0.09 14\n",
" fantasy 0.00 0.00 0.00 0\n",
" game-show 0.35 0.93 0.51 15\n",
" history 0.00 0.00 0.00 0\n",
" horror 0.55 0.68 0.61 348\n",
" music 0.38 0.69 0.49 78\n",
" musical 0.00 0.00 0.00 0\n",
" mystery 0.00 0.00 0.00 0\n",
" news 0.00 0.00 0.00 0\n",
" reality-tv 0.13 0.45 0.20 56\n",
" romance 0.00 0.00 0.00 4\n",
" sci-fi 0.15 0.51 0.24 43\n",
" short 0.31 0.51 0.39 636\n",
" sport 0.17 0.70 0.28 23\n",
" talk-show 0.10 0.57 0.17 14\n",
" thriller 0.12 0.49 0.20 78\n",
" war 0.00 0.00 0.00 0\n",
" western 0.67 0.96 0.79 138\n",
"\n",
" accuracy 0.58 10843\n",
" macro avg 0.21 0.44 0.25 10843\n",
" weighted avg 0.73 0.58 0.63 10843\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_lr_pred, y_test))"
]
},
{
"cell_type": "markdown",
"id": "2cb55f36-66f0-4d3e-9c11-f17b0206de00",
"metadata": {},
"source": [
"### Support Vector CLasssifier"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "94690f84-cd13-487c-8db6-251a6582b6e0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: total: 1h 3min 31s\n",
"Wall time: 1h 3min 39s\n"
]
},
{
"data": {
"text/html": [
"SVC() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"SVC()"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"model_svc = SVC()\n",
"model_svc.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "7be9abb3-ce85-4ad2-924a-6c235092697e",
"metadata": {},
"outputs": [],
"source": [
"y_svc_pred = model_svc.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "97fb4e91-84af-4536-a3c7-38d1d1927f4a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([' drama ', ' drama ', ' comedy ', ..., ' drama ', ' drama ',\n",
" ' horror '], dtype=object)"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_svc_pred"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "68dc7a96-3ac2-4f7d-9c09-5e6623c82e0e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5691229364567002"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy_score(y_svc_pred, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "5375e543-f5a0-40b6-a0d1-28f1d35d2465",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" action 0.13 0.64 0.22 55\n",
" adult 0.14 0.84 0.24 19\n",
" adventure 0.09 0.52 0.15 23\n",
" animation 0.01 1.00 0.02 1\n",
" biography 0.00 0.00 0.00 0\n",
" comedy 0.54 0.53 0.54 1472\n",
" crime 0.00 0.00 0.00 0\n",
" documentary 0.88 0.64 0.74 3692\n",
" drama 0.84 0.50 0.62 4578\n",
" family 0.05 0.73 0.10 11\n",
" fantasy 0.00 0.00 0.00 0\n",
" game-show 0.33 1.00 0.49 13\n",
" history 0.00 0.00 0.00 0\n",
" horror 0.52 0.72 0.60 310\n",
" music 0.25 0.77 0.38 47\n",
" musical 0.00 0.00 0.00 0\n",
" mystery 0.00 0.00 0.00 0\n",
" news 0.00 0.00 0.00 0\n",
" reality-tv 0.06 0.67 0.11 18\n",
" romance 0.00 0.00 0.00 0\n",
" sci-fi 0.11 0.67 0.19 24\n",
" short 0.22 0.60 0.33 389\n",
" sport 0.10 0.90 0.17 10\n",
" talk-show 0.04 0.75 0.07 4\n",
" thriller 0.06 0.50 0.11 40\n",
" war 0.05 1.00 0.10 1\n",
" western 0.66 0.96 0.78 136\n",
"\n",
" accuracy 0.57 10843\n",
" macro avg 0.19 0.52 0.22 10843\n",
" weighted avg 0.76 0.57 0.63 10843\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_svc_pred, y_test))"
]
},
{
"cell_type": "markdown",
"id": "7ea51cf3-3f60-4f56-96bb-e3797be69f02",
"metadata": {},
"source": [
"### Comparision"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "9ad5762e-9aee-4481-a8e4-048d5c5a0ccc",
"metadata": {},
"outputs": [],
"source": [
"acs_nb = accuracy_score(y_nb_pred, y_test)\n",
"acs_lr = accuracy_score(y_lr_pred, y_test)\n",
"acs_svc = accuracy_score(y_svc_pred, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "5f135980-6735-422f-aafd-8addca53b499",
"metadata": {},
"outputs": [],
"source": [
"total = acs_nb + acs_lr + acs_svc\n",
"# total = acs_nb + acs_lr\n",
"pie_part_1 = acs_nb/total\n",
"pie_part_2 = acs_lr/total\n",
"pie_part_3 = acs_svc/total"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "e604753c-47af-4288-895b-445ac0f04f94",
"metadata": {},
"outputs": [],
"source": [
"labels = ['Multinomial Naive Bayes', 'Logistic Regression', 'Support Vector Classifier']\n",
"sizes = [pie_part_1, pie_part_2, pie_part_3]\n",
"# labels = ['Multinomial Naive Bayes', 'Logistic Regression']\n",
"# sizes = [pie_part_1, pie_part_2]"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "2ad8e7eb-7c03-4bd0-aa19-92376414db7b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10,5))\n",
"plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)\n",
"plt.legend(labels, loc=2)"
]
},
{
"cell_type": "markdown",
"id": "ba618643-69ea-472c-8170-80b7a4cba1fc",
"metadata": {},
"source": [
"### Adding the predicted values to a new dataframe with the target genre"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "dd1ae82a-6fa7-48bf-b964-0e544d526fef",
"metadata": {},
"outputs": [],
"source": [
"test_data = test_data[:10843]\n",
"test_data['Predicted_Genre_nb'] = y_nb_pred"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "6eb7e261-f71a-40e2-8719-5292e192ebb1",
"metadata": {},
"outputs": [],
"source": [
"test_data = test_data[:10843]\n",
"test_data['Predicted_Genre_lr'] = y_lr_pred"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "98cb77d3-2cf7-47b4-8a07-3e2e6bdc77e2",
"metadata": {},
"outputs": [],
"source": [
"test_data = test_data[:10843]\n",
"test_data['Predicted_Genre_svm'] = y_svc_pred"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "d77be342-435b-4b50-9138-8271b6eec944",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "cannot insert Target_Genre, already exists",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m~\\AppData\\Local\\Temp\\ipykernel_12232\\3190150681.py\u001b[0m in \u001b[0;36m?\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mtest_data\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mto_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'predicted_genres.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;31m# Add actual genre column to predicted dataFrame\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mextracted_col\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtest_soln_data\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"Target_Genre\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 5\u001b[1;33m \u001b[0mtest_data\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minsert\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m7\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"Target_Genre\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mextracted_col\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py\u001b[0m in \u001b[0;36m?\u001b[1;34m(self, loc, column, value, allow_duplicates)\u001b[0m\n\u001b[0;32m 4927\u001b[0m \u001b[1;34m\"'self.flags.allows_duplicate_labels' is False.\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4928\u001b[0m )\n\u001b[0;32m 4929\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mallow_duplicates\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mcolumn\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4930\u001b[0m \u001b[1;31m# Should this be a different kind of error??\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 4931\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mf\"cannot insert {column}, already exists\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 4932\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mloc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4933\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"loc must be int\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4934\u001b[0m \u001b[1;31m# convert non stdlib ints to satisfy typing checks\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: cannot insert Target_Genre, already exists"
]
}
],
"source": [
"test_data.to_csv('predicted_genres.csv', index=False)\n",
"\n",
"# Add actual genre column to predicted dataFrame\n",
"extracted_col = test_soln_data[\"Target_Genre\"]\n",
"test_data.insert(7, \"Target_Genre\", extracted_col)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "51d3f471-b3c4-4aa0-86d6-01cdebf2c9d6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" id \n",
" Title \n",
" Description \n",
" Corpus_cleaning \n",
" Predicted_Genre_nb \n",
" Target_Genre \n",
" Predicted_Genre_lr \n",
" Predicted_Genre_svm \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" Edgar's Lunch (1998) \n",
" L.R. Brane loves his life - his car, his apar... \n",
" brane loves life car apartment job especially ... \n",
" drama \n",
" NaN \n",
" comedy \n",
" drama \n",
" \n",
" \n",
" 1 \n",
" 2 \n",
" La guerra de papá (1977) \n",
" Spain, March 1964: Quico is a very naughty ch... \n",
" spain march quico naughty child three belongin... \n",
" drama \n",
" thriller \n",
" drama \n",
" drama \n",
" \n",
" \n",
" 2 \n",
" 3 \n",
" Off the Beaten Track (2010) \n",
" One year in the life of Albin and his family ... \n",
" one year life albin family shepherds north tra... \n",
" drama \n",
" comedy \n",
" documentary \n",
" comedy \n",
" \n",
" \n",
" 3 \n",
" 4 \n",
" Meu Amigo Hindu (2015) \n",
" His father has died, he hasn't spoken with hi... \n",
" father died hasnt spoken brother years serious... \n",
" documentary \n",
" documentary \n",
" horror \n",
" horror \n",
" \n",
" \n",
" 4 \n",
" 5 \n",
" Er nu zhai (1955) \n",
" Before he was known internationally as a mart... \n",
" known internationally martial arts superstar b... \n",
" documentary \n",
" drama \n",
" music \n",
" music \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id Title \\\n",
"0 1 Edgar's Lunch (1998) \n",
"1 2 La guerra de papá (1977) \n",
"2 3 Off the Beaten Track (2010) \n",
"3 4 Meu Amigo Hindu (2015) \n",
"4 5 Er nu zhai (1955) \n",
"\n",
" Description \\\n",
"0 L.R. Brane loves his life - his car, his apar... \n",
"1 Spain, March 1964: Quico is a very naughty ch... \n",
"2 One year in the life of Albin and his family ... \n",
"3 His father has died, he hasn't spoken with hi... \n",
"4 Before he was known internationally as a mart... \n",
"\n",
" Corpus_cleaning Predicted_Genre_nb \\\n",
"0 brane loves life car apartment job especially ... drama \n",
"1 spain march quico naughty child three belongin... drama \n",
"2 one year life albin family shepherds north tra... drama \n",
"3 father died hasnt spoken brother years serious... documentary \n",
"4 known internationally martial arts superstar b... documentary \n",
"\n",
" Target_Genre Predicted_Genre_lr Predicted_Genre_svm \n",
"0 NaN comedy drama \n",
"1 thriller drama drama \n",
"2 comedy documentary comedy \n",
"3 documentary horror horror \n",
"4 drama music music "
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "1ed4dbd1-94ba-4afe-9eeb-f6b8e8face3c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of samples Correctly predicted by Multinomial Naive Bayes Classifier: 2718\n",
"Number of samples Correctly predicted by Logistic Regression: 2143\n",
"Number of samples Correctly predicted by Support Vector Classifier: 2298\n"
]
}
],
"source": [
"correctly_predicted_values_nb = (test_data['Predicted_Genre_nb'] == test_data['Target_Genre']).sum()\n",
"correctly_predicted_values_lr = (test_data['Predicted_Genre_lr'] == test_data['Target_Genre']).sum()\n",
"correctly_predicted_values_svm = (test_data['Predicted_Genre_svm'] == test_data['Target_Genre']).sum()\n",
"\n",
"print(\"Number of samples Correctly predicted by Multinomial Naive Bayes Classifier:\", correctly_predicted_values_nb)\n",
"print(\"Number of samples Correctly predicted by Logistic Regression:\", correctly_predicted_values_lr)\n",
"print(\"Number of samples Correctly predicted by Support Vector Classifier:\", correctly_predicted_values_svm)"
]
},
{
"cell_type": "markdown",
"id": "3cac48dd-4e65-470e-aa02-6165630392a3",
"metadata": {},
"source": [
"### Creating the pkl file to predict the user input"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "7b584fdb-faa9-4806-ac2e-d0b210a0428e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Models pickled successfully.\n"
]
}
],
"source": [
"import pickle\n",
"with open('tfidf.pkl', 'wb') as file:\n",
" pickle.dump(tfidf, file)\n",
"with open('model_lr.pkl', 'wb') as file:\n",
" pickle.dump(model_lr, file)\n",
"\n",
"print(\"Models pickled successfully.\")"
]
},
{
"cell_type": "markdown",
"id": "9dfa5382-ddcf-4cad-9079-2635bf616fc0",
"metadata": {},
"source": [
"#### Sample text data"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "2a7553ad-681b-4875-b9dd-7697a6da0f8d",
"metadata": {},
"outputs": [],
"source": [
"# title = \"Edgar's Lunch (1998)\"\n",
"# discription = \"L.R. Brane loves his life - his car, his apartment, his job, but especially his girlfriend, Vespa. One day while showering, Vespa runs out of shampoo. L.R. runs across the street to a convenience store to buy some more, a quick trip of no more than a few minutes. When he returns, Vespa is gone and every trace of her existence has been wiped out. L.R.'s life becomes a tortured existence as one strange event after another occurs to confirm in his mind that a conspiracy is working against his finding Vespa.\""
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "bb798fc2-640b-481a-9736-9b0b37e339c3",
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"Enter movie Title Edgar's Lunch (1998)\n",
"Enter movie Discription L.R. Brane loves his life - his car, his apartment, his job, but especially his girlfriend, Vespa. One day while showering, Vespa runs out of shampoo. L.R. runs across the street to a convenience store to buy some more, a quick trip of no more than a few minutes. When he returns, Vespa is gone and every trace of her existence has been wiped out. L.R.'s life becomes a tortured existence as one strange event after another occurs to confirm in his mind that a conspiracy is working against his finding Vespa.\n"
]
}
],
"source": [
"title = input(\"Enter movie Title\")\n",
"discription = input(\"Enter movie Discription\")"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "2233b21f-64a9-41b4-9489-8caf80fcaa5c",
"metadata": {},
"outputs": [],
"source": [
"new_data = [title, discription]"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "78e0df09-37ec-4fbe-b727-70462424a8bf",
"metadata": {},
"outputs": [],
"source": [
"new_data_transformed = tfidf.transform(new_data)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "4b510fb2-f55e-46f8-a88a-a2eabf7618ac",
"metadata": {},
"outputs": [],
"source": [
"predictions = model_nb.predict(new_data_transformed)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "dcc2b1a6-e64a-47e1-8761-dbb05d45494d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Text: Predicted Genre: drama \n",
"Text: Predicted Genre: drama \n"
]
}
],
"source": [
"for text, prediction in zip(new_data, predictions):\n",
" print(f\"Text: Predicted Genre: {prediction}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5216b314-2d71-4d3f-967a-65a79c5f7e85",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}